Table of Contents:

  • Problem Statement

  • Imports/packages

  • EDA (Exploratory Data Analysis)

    • Distributions, graphs, etc.
  • Data Preprocessing

  • Model Creation/Testing

Problem Statement:

The task of the competition is to predict the temperature at the next hour given the weather conditions and temperature for prior hours. This is a multivariate timeseries forecasting problem, and we use a neural network approach to tackle the problem.

In [5]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
import plotly as py
import plotly.graph_objs as go
%matplotlib inline

EDA

In [6]:
X_train = pd.read_csv("climate_hour_train.csv")
In [7]:
X_train.head() #peak at the data
Out[7]:
Date Time p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%) VPmax (mbar) VPact (mbar) VPdef (mbar) sh (g/kg) H2OC (mmol/mol) rho (g/m**3) wv (m/s) max. wv (m/s) wd (deg)
0 01.01.2009 01:00:00 996.50 -8.05 265.38 -8.78 94.4 3.33 3.14 0.19 1.96 3.15 1307.86 0.21 0.63 192.7
1 01.01.2009 02:00:00 996.62 -8.88 264.54 -9.77 93.2 3.12 2.90 0.21 1.81 2.91 1312.25 0.25 0.63 190.3
2 01.01.2009 03:00:00 996.84 -8.81 264.59 -9.66 93.5 3.13 2.93 0.20 1.83 2.94 1312.18 0.18 0.63 167.2
3 01.01.2009 04:00:00 996.99 -9.05 264.34 -10.02 92.6 3.07 2.85 0.23 1.78 2.85 1313.61 0.10 0.38 240.0
4 01.01.2009 05:00:00 997.46 -9.63 263.72 -10.65 92.2 2.94 2.71 0.23 1.69 2.71 1317.19 0.40 0.88 157.0

Below we have some basic statistics:

In [8]:
X_train.describe()
Out[8]:
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%) VPmax (mbar) VPact (mbar) VPdef (mbar) sh (g/kg) H2OC (mmol/mol) rho (g/m**3) wv (m/s) max. wv (m/s) wd (deg)
count 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000 52566.000000
mean 988.723002 9.172795 283.254265 4.779049 76.444300 13.357483 9.458133 3.899249 5.977212 9.568031 1216.718989 2.142170 3.539017 173.689628
std 8.190684 8.533081 8.605048 6.922701 16.430164 7.572008 4.201679 4.723265 2.666892 4.253017 40.439912 1.530832 2.313246 87.251111
min 918.500000 -22.760000 250.850000 -24.800000 13.060000 0.970000 0.810000 0.000000 0.510000 0.810000 1066.190000 0.000000 0.000000 0.000000
25% 983.750000 3.110000 277.242500 0.130000 65.810000 7.640000 6.170000 0.810000 3.890000 6.240000 1188.082500 1.010000 1.800000 120.800000
50% 989.140000 9.310000 283.430000 5.200000 79.700000 11.740000 8.850000 2.090000 5.595000 8.965000 1213.440000 1.790000 3.000000 197.100000
75% 994.070000 15.280000 289.370000 10.030000 89.800000 17.390000 12.320000 5.130000 7.780000 12.450000 1243.050000 2.880000 4.750000 233.800000
max 1012.740000 35.480000 309.690000 22.940000 100.000000 57.800000 28.040000 41.710000 17.940000 28.530000 1392.560000 12.580000 20.330000 360.000000

Let's check if there are any null values in the data through a heatmap. If we are missing any data (through NaN or null values), it will be highlighted in the heatmap as yellow:

In [9]:
sns.heatmap(X_train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a265d84e0>

Distributions

Below we have different distribution for some of the variables. On the top left we have temperature, which is our target variable in Celcius. We can see that the distribution is slighlty bimodal but is fairly normally distributed across. We can also see some of the variables that would have more of an impact if I was meteorologist, but visually we can see that their distributions are fairly similar to each other.

In [11]:
 sns.set(style="white", palette="muted", color_codes=True)

# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(7,7))
sns.despine(left=True)

# Plot a simple histogram with binsize determined automatically
sns.distplot(X_train['T (degC)'], kde=False, color="b", ax=axes[0, 0])

# Plot a kernel density estimate and rug plot
sns.distplot(X_train['H2OC (mmol/mol)'], hist=False, rug=True, color="r", ax=axes[0, 1])

# Plot a filled kernel density estimate
sns.distplot(X_train['VPact (mbar)'], hist=False, color="g", kde_kws={"shade": True}, ax=axes[1, 0])

# Plot a historgram and kernel density estimate
sns.distplot(X_train['VPmax (mbar)'], color="m", ax=axes[1, 1])

plt.setp(axes, yticks=[])
plt.tight_layout()

In the pairplot below, we can see the correlation between all the variables. If we look at temperature specifically (the second row) the predictors are fairly correlated, with the exception of a few weaker ones.

In [3]:
sns.pairplot(X_train.drop("Date Time",axis=1))
Out[3]:
<seaborn.axisgrid.PairGrid at 0x1a1a534e80>

Alternatively for a more intuitive view over time (this being a time-series problem), we can see the temperature change (in Celcius) from the first date to the last of the training set (Since the actual data was already in order by date, I used the index for the X axis of this graph).

In [16]:
plt.figure(figsize=(22,6))
sns.lineplot(x=X_train.index, y=X_train['T (degC)'])
plt.title('Temperature Change Over Time')
plt.show()

Let's zoom in to the first 24 hours, since that's what we're going to be timestepping by:

In [18]:
plt.plot(range(24), X_train["T (degC)"][:24])
Out[18]:
[<matplotlib.lines.Line2D at 0x1a207a3438>]

We know that the dataset starts at midnight, which would account for lower temperatures (since the sun's not up obviously), and we can see that it gradually rises.

Data Pre-Processing

In [ ]:
from scipy import stats
pd.set_option('display.max_columns', None)
import tensorflow
In [2]:
X_train = pd.read_csv("climate_hour_train.csv")
X_test1 = pd.read_csv("climate_Xtest.csv",header=None)
X_test2 = pd.read_csv("climate_Xtest2.csv",header=None)
X_test3 = pd.read_csv("climate_Xtest3.csv",header=None)  

# Had to split up test file into 3 separate files and concatenate them because
# the original file was too big for IBM cognitive class
In [3]:
numeric_cols = X_train.select_dtypes(include=[np.number]).columns
X_train_Norm = X_train[numeric_cols].apply(stats.zscore)        #Z score normalization train
In [4]:
X_test = pd.concat([X_test1, X_test2], axis = 1)
X_test = pd.concat([X_test, X_test3], axis = 1)  
X_test = X_test.apply(stats.zscore)       #Z score normalization test
In [6]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg
In [7]:
x_train = series_to_supervised(X_train_Norm,0,24,True)   #reformat for shape
In [8]:
x_train.drop(52542,axis=0,inplace=True)   #needed to drop a row for shape
In [9]:
y_train = np.asarray(x_train['var2(t+23)'])     #target temp
x_train = np.asarray(x_train)
In [10]:
x_train = x_train.reshape((52542,24,14))     #keras ready
y_train = y_train.reshape((y_train.shape[0], 1))
In [11]:
y_train.shape
Out[11]:
(52542, 1)

Model Creation/Testing

In [12]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
In [36]:
data_dim = 14
timesteps = 24
model = Sequential()
model.add(LSTM(80,input_shape=(timesteps,data_dim)))  #80 seems to be sweet spot
model.add(Dropout(0.06))                             # dropout- anything bigger than 10 gave worse results?
model.add(Dense(1, activation='linear'))       #linear shows best results for activation 

#simplest seems to be best

model.compile(loss='mae',           #MAE, tried different optimizers SGD, Adadelta, etc
              optimizer='adam',
              metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_6 (LSTM)                (None, 80)                30400     
_________________________________________________________________
dropout_6 (Dropout)          (None, 80)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 81        
=================================================================
Total params: 30,481
Trainable params: 30,481
Non-trainable params: 0
_________________________________________________________________
In [41]:
model.fit(x_train, y_train, batch_size=80, epochs=25,shuffle=False)    #tried a couple of different epochs, 80 seams to be the best
Epoch 1/25
52542/52542 [==============================] - 39s 746us/step - loss: 0.0216 - acc: 0.0000e+00
Epoch 2/25
52542/52542 [==============================] - 39s 742us/step - loss: 0.0221 - acc: 0.0000e+00
Epoch 3/25
52542/52542 [==============================] - 39s 739us/step - loss: 0.0219 - acc: 0.0000e+00
Epoch 4/25
52542/52542 [==============================] - 39s 747us/step - loss: 0.0214 - acc: 0.0000e+00
Epoch 5/25
52542/52542 [==============================] - 39s 745us/step - loss: 0.0220 - acc: 0.0000e+00
Epoch 6/25
52542/52542 [==============================] - 40s 762us/step - loss: 0.0211 - acc: 0.0000e+00
Epoch 7/25
52542/52542 [==============================] - 40s 760us/step - loss: 0.0207 - acc: 0.0000e+00
Epoch 8/25
52542/52542 [==============================] - 41s 778us/step - loss: 0.0220 - acc: 0.0000e+00
Epoch 9/25
52542/52542 [==============================] - 41s 775us/step - loss: 0.0213 - acc: 0.0000e+00
Epoch 10/25
52542/52542 [==============================] - 41s 771us/step - loss: 0.0211 - acc: 0.0000e+00
Epoch 11/25
52542/52542 [==============================] - 41s 774us/step - loss: 0.0208 - acc: 0.0000e+00
Epoch 12/25
52542/52542 [==============================] - 41s 774us/step - loss: 0.0213 - acc: 0.0000e+00
Epoch 13/25
52542/52542 [==============================] - 41s 775us/step - loss: 0.0209 - acc: 0.0000e+00
Epoch 14/25
52542/52542 [==============================] - 42s 790us/step - loss: 0.0212 - acc: 0.0000e+00
Epoch 15/25
52542/52542 [==============================] - 41s 782us/step - loss: 0.0210 - acc: 0.0000e+00
Epoch 16/25
52542/52542 [==============================] - 40s 758us/step - loss: 0.0209 - acc: 0.0000e+00
Epoch 17/25
52542/52542 [==============================] - 41s 781us/step - loss: 0.0211 - acc: 0.0000e+00
Epoch 18/25
52542/52542 [==============================] - 41s 777us/step - loss: 0.0211 - acc: 0.0000e+00
Epoch 19/25
52542/52542 [==============================] - 41s 771us/step - loss: 0.0203 - acc: 0.0000e+00
Epoch 20/25
52542/52542 [==============================] - 41s 778us/step - loss: 0.0217 - acc: 0.0000e+00
Epoch 21/25
52542/52542 [==============================] - 42s 805us/step - loss: 0.0207 - acc: 0.0000e+00
Epoch 22/25
52542/52542 [==============================] - 41s 784us/step - loss: 0.0206 - acc: 0.0000e+00
Epoch 23/25
52542/52542 [==============================] - 41s 786us/step - loss: 0.0218 - acc: 0.0000e+00
Epoch 24/25
52542/52542 [==============================] - 41s 773us/step - loss: 0.0204 - acc: 0.0000e+00
Epoch 25/25
52542/52542 [==============================] - 41s 776us/step - loss: 0.0216 - acc: 0.0000e+00
Out[41]:
<tensorflow.python.keras.callbacks.History at 0x3eff6c3ebf60>
In [42]:
predictions = pd.DataFrame(model.predict(np.asarray(X_test).reshape((17447,24,14)),batch_size=80)).apply(lambda x: (x * 8.01432) + 10.25075)
#final predictions, reconversion to temp from normalized
In [43]:
predictions.to_csv("predictions.csv")
In [2]:
loss = [0.0216,0.0221,0.0219,0.0214,0.0220,0.0211,0.0207,0.0220,0.0213,0.0211,0.0208,0.0213,0.0209,0.0212,
0.0210,0.0209,0.0211,0.0211,0.0203,0.0217,0.0207,0.0206,0.0218,0.0204,0.0216]
In [16]:
sns.lineplot(range(1,26),loss)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a22b2c860>
In [21]:
loss2 = [0.0960,0.0326,0.0248,0.0220,0.0222,0.0212,0.0205,0.0207,0.0202,0.0210,0.0194,0.0201,0.0200,0.0200
,0.0198,0.0194,0.0197,0.0195,0.0192,0.0198,0.0191,0.0189,0.0199,0.0192,0.0188]
In [23]:
plt.ylim(0.018,0.024)
sns.lineplot(range(1,26),loss2)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2340d550>
In [ ]: